We would want to build a regression model with more than 1 predictor because using more than 1 predictor can tell us more about our outcome variable. Including multiple predictors can improve a model and allow it to be used for prediction/forecasting.
Modeling car’s miles per gallon in a city (Y) by the make of the car as represented by mu = B_0 + B_1X_1 + B_2X_2 + B_3*X_3 where X_1 = Kias, X_2 = Subarus, and X_3 = Toyotas.
There is no indicator term for the Ford category because it is our baseline or reference level of the make variable and will have X_0 = 0.
B_2 represents the typical difference in a car’s gas mileage for a 1-unit increase in X2 i.e. versus a Ford, Subarus have an X2 change in gas mileage.
B_0 represents the typical gas mileage for a Ford.
mu = B_0 + B_1X_1 + B_2X_2 where B_0 is weight of Mr. Stripey, X_1 is days a tomato has been growing and X_2 is Roma tomato.
B_0 is our reference and level and it is the weight of a Mr. Stripey tomato. B_1 measures the weight of a tomato for a 1-unit increase in X_1 which is days a tomato has been growing. B_2 measures the typical change in the weight of a tomato for a unit increase of X_2 i.e. the weight change as a result of being a Roma tomato.
If B_2 were equal to 0 that means mu = B_0 + B_1*X_1 = weight of a Mr. Stripey tomato that has grown for X_1 days. I.e. if B_2 were 0, we would be measuring the size of a tomato when it is a Mr. Stripey tomato growing for X number of days. Weight is outcome and the Mr. Stripey and days of growth are the predictors.
For X_1 and X_2 to interact it means that X_1 moderates the relationship between X_2 and mu. In tomato terms, The number of days a tomato is growing (X_1) moderates the relationship between the Roma tomato category (X_2) and a tomato’s weight in grams (Y).
B_3 is the interaction coefficient and it captures the difference in slopes which represents the change in weight when Roma category is excluded/included.
Y = shoe size X_1 = children’s age in years X_2 = children’s swimming knowledge
Generally speaking, adding predictors improves models because outcomes may be correlated with more than 1 predictor. Adding predictors incorporates flexibility that improves upon models with high bias.
It is possible to overfit. When there is high variance and low bias in a model, the model closely aligns with data which means it is also reflecting noise or error in the data and is actually a worse model than those that with in between varaince and in-betweeen bias.
For a model on children’s shoe size I would add children’s height in cm. I would add height assuming that height and feet length have some proportion. It’s also data that is feasible and probablly fair to collect.
I would remove children’s knowledge of swimming as there is not a reasonable causal pathway for how swimming or not leads to variation in shoe size.
Good models capture the central tendency and range of data it was created with. Good models have also undergone training and testing. They also have bias and variability that is neither too high or too low.
Bad models can be overly simple (high bias, low variability) or overly complex (low bias, high variability) meaning they don’t capture enough of key features of the data or capture too much noise/error.
Techniques to assess models are visualization, cross-validation, and ELPD. Visualizations are used to evaluate predictive accuracy and compare different models. We can use pp_check and posterior predictive models to visualize how models compare to data and to each other. Cross-validation is a way to numerically assess the posterior predictive quality of models. Using k-fold cross-validation involves training the model on one set of data and testing with another. ELPD (expected log-predictive densities) is another numerical way to asess predictive accuracy. Larger expected logged posterior predictive pdf corresponds to more accurate posterior predictions of Y.
The bias-variance tradeoff is the the challenge of using enough predictors to get useful information about Y without including too much information. Too few predictors will give models with low variance across samples and high bias. Too many predictors will capture noise and errors and the model will be unstable; the model will have high variance and low bias.
Load packages for applied exercises.
#loading libraries for exercises
library(bayesrules)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(bayesplot)
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
library(rstanarm)
## Loading required package: Rcpp
## This is rstanarm version 2.21.1
## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores())
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.21.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
library(tidybayes)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(broom.mixed)
Using penguins_bayes to build models of body_mass_g (Y). Average penguin weighs between 3500-4500 g (estimated variance 250) X= species = Adelie, Chinstrap, Gentoo = categorical. Will work with 2 level predictors in penguin_data.
#loading alternative penguin data
penguin_data <- penguins_bayes |>
filter(species %in% c("Adelie", "Gentoo"))
ggplot(data = penguin_data, aes(y = body_mass_g, x = flipper_length_mm, color = species)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
For both species, as flipper length increases, body mass increases. For Gentoo the increase is more rapid, suggesting that species may moderate the relationship between flipper length and body mass.
penguin_model <- stan_glm(
body_mass_g ~ flipper_length_mm + species,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250), #midpoint of avg g given in th ech.
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE), #1/250
chains = 4, iter = 5000*2, seed = 8000)
MCMC diagnostics
#Store 4 chains for each parameter in 1 data frame for later
penguin_model_df <- as.data.frame(penguin_model)
#visual diagnostics
#trace and density plots
mcmc_trace(penguin_model, size = 0.1)
# Density plots of parallel chains
mcmc_dens_overlay(penguin_model)
#chains look stable and density plots look
#numerical diagnostics
# Effective sample size ratio and Rhat
neff_ratio(penguin_model)
## (Intercept) flipper_length_mm speciesGentoo sigma
## 0.47355 0.46930 0.47975 0.71670
rhat(penguin_model)
## (Intercept) flipper_length_mm speciesGentoo sigma
## 1.000186 1.000184 1.000119 1.000060
The R-hat values are very close to 1 meaning the chains are stable and mix quickly. The effective sample size ratio.
tidy(penguin_model, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.80) |>
select(-std.error)
## # A tibble: 5 × 4
## term estimate conf.low conf.high
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) -4379. -5262. -3487.
## 2 flipper_length_mm 42.5 37.8 47.1
## 3 speciesGentoo 217. 78.2 361.
## 4 sigma 393. 373. 416.
## 5 mean_PPD 4316. 4273. 4358.
flipper_length_mm is positively associated with body_mass_g as is species Gentoo, which is 216.52 g heavier than Adelie.
penguin_interaction_model <- stan_glm(
body_mass_g ~ species + flipper_length_mm + species:flipper_length_mm,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 8000)
pp_check(penguin_interaction_model)
The plot shows that the model captures the range and central tendencies of the data well. Both the model and the 50 lines have a minimum of 2 peaks. While the model doesn’t overfit the data, the use of multiple predictors gave us a better model than flipper length alone.
For comparison here is the pp_check of a non-interaction model with one predictor, flipper length.
penguin_one_model <- stan_glm(
body_mass_g ~ flipper_length_mm,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 8000)
pp_check(penguin_one_model)
tidy(penguin_interaction_model, effects = c("fixed", "aux"))
## # A tibble: 6 × 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) -2893. 875.
## 2 speciesGentoo -3341. 1314.
## 3 flipper_length_mm 34.7 4.60
## 4 speciesGentoo:flipper_length_mm 17.3 6.42
## 5 sigma 387. 17.0
## 6 mean_PPD 4316. 33.0
posterior_interval(penguin_interaction_model, prob = 0.80,
pars = "speciesGentoo:flipper_length_mm")
## 10% 90%
## speciesGentoo:flipper_length_mm 9.174402 25.60938
penguin_data |> drop_na() |>
add_fitted_draws(penguin_interaction_model, n = 200) |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_line(aes(y=.value, group = paste(species, .draw)), alpha = 0.1)
## Warning: `fitted_draws` and `add_fitted_draws` are deprecated as their names were confusing.
## Use [add_]epred_draws() to get the expectation of the posterior predictive.
## Use [add_]linpred_draws() to get the distribution of the linear predictor.
## For example, you used [add_]fitted_draws(..., scale = "response"), which
## means you most likely want [add_]epred_draws(...).
Based on summary and this visualization I would say that the interaction term is necessary. Species moderates the relationship between flipper_length_mm and body_mass_g. Being Gentoo (blue) as opposed to Adelie increases the positive relationship between flipper_length_mm and body_mass_g. The 80% posterior credible interval for the interaction coefficient is above 0 and ranges from 9.17 to 25.61.
Three predictors and no interactions: flipper_length_mm, bill_length_mm, bill_depth_mm.
penguin_model_3 <- stan_glm(
body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm ,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 8000)
# Confirm prior specification
prior_summary(penguin_model_3)
# Check MCMC diagnostics
mcmc_trace(penguin_model_3)
mcmc_dens_overlay(penguin_model_3)
mcmc_acf(penguin_model_3)
posterior_interval(penguin_model_3, prob = 0.95)
## 2.5% 97.5%
## (Intercept) -7249.42574 -5028.55451
## flipper_length_mm 26.11961 38.23013
## bill_length_mm 55.62631 87.56523
## bill_depth_mm 27.55968 80.22603
## sigma 312.67673 370.20147
penguin_model_4 <- stan_glm(
body_mass_g ~ .,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 8000)
posterior_interval(penguin_model_4, prob = 0.95)
## 2.5% 97.5%
## (Intercept) 16967.071808 170943.720217
## speciesGentoo 332.968842 852.046816
## islandDream -123.291986 68.918310
## islandTorgersen -141.008733 58.292925
## year -85.827030 -8.837732
## bill_length_mm 8.169976 35.197786
## bill_depth_mm 18.796586 90.250796
## flipper_length_mm 7.792649 19.496381
## above_average_weight1 387.550791 591.059962
## sexmale 292.640265 470.133041
## sigma 219.832953 262.245305
#model 1
penguin_model_1 <- stan_glm(
body_mass_g ~ flipper_length_mm,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 8000)
#model 2
penguin_model_2 <- stan_glm(
body_mass_g ~ species,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 8000)
#model 3
penguin_model_3 <- stan_glm(
body_mass_g ~ flipper_length_mm + species,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 8000)
#model 4
penguin_model_4 <- stan_glm(
body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,
data = penguin_data, family = gaussian,
prior_intercept = normal(4000, 250),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(0.004, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 8000)
pp_check(penguin_model_1)
pp_check(penguin_model_2)
pp_check(penguin_model_3)
pp_check(penguin_model_4)
#creating data set without missing data
penguins_comp <- penguin_data |>
select(flipper_length_mm, body_mass_g, species,
bill_length_mm, bill_depth_mm) |>
na.omit()
#cross validation of four models
set.seed(8000)
prediction_summary_cv(model = penguin_model_1, data = penguins_comp, k = 10)
## $folds
## fold mae mae_scaled within_50 within_95
## 1 1 246.8452 0.6274958 0.5357143 0.9285714
## 2 2 304.5235 0.7694913 0.4074074 0.9629630
## 3 3 248.7804 0.6151063 0.5555556 0.9629630
## 4 4 222.7751 0.5595664 0.5357143 0.9642857
## 5 5 348.6981 0.8869174 0.3333333 0.9259259
## 6 6 215.6457 0.5382611 0.6666667 0.9259259
## 7 7 300.2348 0.7666828 0.4285714 0.9285714
## 8 8 304.6348 0.7619095 0.4444444 1.0000000
## 9 9 203.3990 0.5042036 0.5925926 0.9629630
## 10 10 226.4402 0.5581455 0.5714286 0.9642857
##
## $cv
## mae mae_scaled within_50 within_95
## 1 262.1977 0.658778 0.5071429 0.9526455
set.seed(8000)
prediction_summary_cv(model = penguin_model_2, data = penguins_comp, k = 10)
## $folds
## fold mae mae_scaled within_50 within_95
## 1 1 360.4104 0.7543538 0.4285714 0.8928571
## 2 2 403.6032 0.8329245 0.3333333 1.0000000
## 3 3 437.2558 0.9064028 0.2962963 1.0000000
## 4 4 286.3682 0.5941468 0.5714286 0.9642857
## 5 5 441.2444 0.9101616 0.4814815 0.9259259
## 6 6 365.2359 0.7502957 0.4814815 0.9629630
## 7 7 280.3658 0.5804143 0.6071429 0.9285714
## 8 8 526.8396 1.1137135 0.2592593 1.0000000
## 9 9 311.6731 0.6344041 0.5555556 0.9629630
## 10 10 317.0627 0.6399031 0.5357143 1.0000000
##
## $cv
## mae mae_scaled within_50 within_95
## 1 373.0059 0.771672 0.4550265 0.9637566
set.seed(8000)
prediction_summary_cv(model = penguin_model_3, data = penguins_comp, k = 10)
## $folds
## fold mae mae_scaled within_50 within_95
## 1 1 261.6374 0.6649825 0.5000000 0.9285714
## 2 2 317.8054 0.7972312 0.4074074 0.9629630
## 3 3 308.3408 0.7605406 0.3703704 0.9629630
## 4 4 208.8396 0.5273125 0.6071429 0.9642857
## 5 5 399.5042 1.0394804 0.2962963 0.9629630
## 6 6 226.1543 0.5715921 0.6666667 0.9259259
## 7 7 324.0330 0.8287271 0.4285714 0.9285714
## 8 8 331.1047 0.8313187 0.2962963 1.0000000
## 9 9 225.0195 0.5639652 0.5925926 0.9629630
## 10 10 207.6343 0.5124322 0.6071429 0.9642857
##
## $cv
## mae mae_scaled within_50 within_95
## 1 281.0073 0.7097582 0.4772487 0.9563492
set.seed(8000)
prediction_summary_cv(model = penguin_model_4, data = penguins_comp, k = 10)
## $folds
## fold mae mae_scaled within_50 within_95
## 1 1 251.3421 0.7481516 0.4285714 0.8928571
## 2 2 214.7109 0.6247227 0.5185185 1.0000000
## 3 3 232.5114 0.6652209 0.5185185 1.0000000
## 4 4 180.3222 0.5208608 0.6071429 0.8928571
## 5 5 231.0877 0.6818039 0.4814815 0.8888889
## 6 6 149.1957 0.4362503 0.5925926 0.8888889
## 7 7 280.3003 0.8412392 0.4285714 0.8928571
## 8 8 220.2031 0.6302804 0.5555556 1.0000000
## 9 9 237.0537 0.6880010 0.4814815 0.9629630
## 10 10 158.7289 0.4522308 0.6071429 0.9642857
##
## $cv
## mae mae_scaled within_50 within_95
## 1 215.5456 0.6288762 0.5219577 0.9383598